Analyzing NYC Traffic Collision Data

A 15-688 Project by:

Ahmet Emre Unal (ahmetemu)

Marco Peyrot (mpeyrotc)

New York City is a beautiful city with terrible traffic. The drivers are impatient and aggressive, which results in many traffic collisions every day, some of which, quite unfortunately, lead to injuries and death.

For our project, we wanted to understand NYC's most collision-prone areas and try to predict collisions based on many factors, such as location, vehicle type, whether it's a weekday, etc.

The NYPD Motor Vehicle Collisions dataset we found was surprisingly feature-rich and populous. It also meant that there were lots of collision instances with missing data.


In [1]:
import os
from os import path
from bs4 import BeautifulSoup
import collections
import numpy as np
import pandas as pd
import re
import requests 
import scipy
from scipy.cluster.vq import kmeans, kmeans2, vq
from scipy.spatial.distance import cdist
from IPython.core.display import display, HTML
from colour import Color
from sklearn import linear_model
import webbrowser
from geopy.distance import distance

# Plotting
import matplotlib
matplotlib.use('svg')
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import seaborn as sns
from PIL import Image

# For scikit-learn to chill out about deprecation
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

raw_data_file_name = 'NYPD_Motor_Vehicle_Collisions.csv'
bicycle_lanes_page_url = 'http://www.nyc.gov/html/dot/html/bicyclists/lane-list.shtml'
prime_colors = ['Red', 'Blue', 'Purple', 'Green', 'Yellow', 'Cyan', 'Magenta', 'Black', 'White']
current_working_dir = os.getcwd()
mapbox_access_token = 'pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpandmbXliNDBjZWd2M2x6bDk3c2ZtOTkifQ._QA7i5Mpkd_m30IGElHziw'

def get_map_output_url(output_file_name):
    return 'file://%s' % path.join(current_working_dir, output_file_name + '.html')

def create_map_html_file(output_file_name):
    """
    Creates a webpage that will hold the map where the points of each cluster are plotted. This page
    shows an instance of Leaflet (http://leafletjs.com/).

    Parameters:
        title (string): the name of the webpage that will appear on the browser
        script (string): the location of the file that contains the plotting information (must be a
                        javascript file)

    Returns:
        The HTML code that forms the webpage.
    """
    page = """
        <!DOCTYPE html>
        <html>
            <head>
                <title>TITLE</title>
                <meta charset="utf-8" />
                <meta name="viewport" content="width=device-width, initial-scale=1.0">

                <link rel="stylesheet" href="https://unpkg.com/leaflet@1.0.2/dist/leaflet.css" />
                <script src="https://unpkg.com/leaflet@1.0.2/dist/leaflet.js"></script>
            </head>
            <body>
                <div id="mapid" style="min-width: 800px; min-height: 600px;"></div>
                <script src="SCRIPT"></script>
            </body>
        </html>
    """

    page = page.replace("TITLE", output_file_name)
    page = page.replace("SCRIPT", output_file_name + '.js')
    
    with open(output_file_name + '.html', "w") as map_file:
        map_file.write(page)

def create_map_script_file(output_file_name, lat, lon, zoom):
    """
    Creates a javascript file with the basic content to display a map in a webpage.

    Parameters:
        output_file_name (string): the name (or directory) of the javascript file to be created
        lat (float): the latitude where the map will be centered
        lon (float): the longitude where the map will be centered
        zoom (int): the zoom level in which the map starts (can be changed in browser)

    Returns:
        Nothing, the file is written onto disk
    """
    
    script = """    
    var mymap = L.map('mapid').setView([LAT, LON], ZOOM);
    
    L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=ACC_TOKEN', {
        maxZoom: 18,
        attribution: 'Map data &copy; <a href="http://openstreetmap.org">OpenStreetMap</a> contributors, ' +
            '<a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, ' +
            'Imagery © <a href="http://mapbox.com">Mapbox</a>',
        id: 'mapbox.streets'
    }).addTo(mymap);
    """
    script = script.replace("LAT", str(lat))
    script = script.replace("LON", str(lon))
    script = script.replace("ZOOM", str(zoom))
    script = script.replace("ACC_TOKEN", mapbox_access_token)
    
    with open(output_file_name + '.js', "w") as script_file:
        script_file.write(script)
        
def append_points_to_script(output_file_name, points, color, size):
    """
    Appends to a javascript file all the points to be plotted with a given color and size

    Parameters:
        output_file_name (string): the name (or directory) of the javascript file to be used
        points (list): the list of points to be plotted, as a DataFrame
        color (Color): the color in which the points will be plotted
        size (int): the size of each point

    Returns:
        Nothing, the file is written onto disk
    """
    
    point_js = """
    L.circle([LAT, LON], SIZE, {
        color: 'COLOR',
        fillColor: 'COLOR',
        fillOpacity: 0.5
    }).addTo(mymap);
    """
    
    point_js = point_js.replace('SIZE', str(size))
    point_js = point_js.replace('COLOR', str(color.hex))
    
    if type(points) == pd.DataFrame:
        points = points.loc[:, ['LATITUDE', 'LONGITUDE']].values
        
    with open(output_file_name + '.js', "a") as script_file:
        for lat, lon in points:
            new_point_js = point_js.replace('LAT', str(lat))
            new_point_js = new_point_js.replace('LON', str(lon)) 
            script_file.write(new_point_js)

Note: the utility functions defined above are mainly used to plot the data employed in this notebook on a map. In order to do so, we required generating custom HTML that would accomplish this.

We started by cleaning up rows with empty and 'UNKNOWN' values, throwing away about a third of the original dataset. We then dropped columns that either were unnecessary (like the reason for the collision, since this sort of information is not possible to infer before the collision, like the driver being distracted) or were too detailed (like the vehicle subtypes).


In [2]:
def load_data(file_name):
    collision = pd.read_csv(file_name, 
                            na_filter=False, 
                            parse_dates={'DATE_COMBINED' : ['DATE', 'TIME']}, 
                            infer_datetime_format=True)
    
    # Remove rows that don't have the necessary data
    columns_to_check_for_empty = ['LOCATION', 'LATITUDE', 'LONGITUDE', 'BOROUGH',
                                  'ZIP CODE', 'ON STREET NAME', 'CROSS STREET NAME', 
                                  'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 1']
    for column in columns_to_check_for_empty:
        collision = collision[collision[column] != '']
        collision = collision[collision[column] != 'UNKNOWN']

    # Drop unneeded columns
    columns_to_drop = ['CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 
                       'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', 
                       'CONTRIBUTING FACTOR VEHICLE 5', 'LOCATION', 'OFF STREET NAME', 
                       'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 
                       'VEHICLE TYPE CODE 5']
    for column in columns_to_drop:
        collision = collision.drop(column, axis=1)
        
    # Set column types
    collision['ZIP CODE'] = collision['ZIP CODE'].astype(int)
    collision['LATITUDE'] = collision['LATITUDE'].astype(float)
    collision['LONGITUDE'] = collision['LONGITUDE'].astype(float)
    
    # Rename date column to just 'DATE'
    collision = collision.rename(columns={'DATE_COMBINED':'DATE'})
    
    # Eliminate duplicates
    collision = collision.drop_duplicates()
    
    # Reset index
    collision = collision.reset_index(drop=True)
    
    return collision

We proceeded to create two temporal features:

  1. Whether it occurred on a weekday or weekend
  2. In what 'time period' of the day it occurred

The 'time period' requires some explanation: We all know that morning and evening rush hours can be especially more collision-prone than daytime (around noon) and night time (after the evening rush and before the morning rush). We decided to create these four bins with the following time ranges:

  1. Night Time: 00:00-06:59 & 20:00-00:00
  2. Morning Rush: 07:00-10:59
  3. Night Time: 00:00-06:59 & 20:00-00:00
  4. Evening Rush: 16:00-19:59

Together with the weekday/weekend feature, this allowed us to represent the date in a way that is meaningful to a machine learning algorithm. We also assumed that, as long as it's a weekday, the exact day (whether it's a Monday or a Thursday) is not significant.


In [3]:
def create_temporal_features(collision):
    collision = _create_time_features(collision)
    collision = _create_date_features(collision)
    
    # We're done with date, we can drop it
    collision = collision.drop('DATE', axis=1)
    
    return collision
    

def _create_time_features(collision):
    # Create one-hot time of day representation
    ## Date part is unimportant
    morning_rush_begin = pd.datetime(2000, 01, 01, 07, 00, 00).time()
    morning_rush_end = pd.datetime(2000, 01, 01, 11, 00, 00).time()
    evening_rush_begin = pd.datetime(2000, 01, 01, 16, 00, 00).time()
    evening_rush_end = pd.datetime(2000, 01, 01, 20, 00, 00).time()
    collision_time = collision['DATE'].dt.time
    
    ## Night Time 00:00-06:59 & 20:00-00:00
    night_time = (collision_time >= evening_rush_end) | (collision_time < morning_rush_begin)
    night_time_onehot = pd.get_dummies(night_time).loc[:, True].astype(int)
    collision = collision.assign(NIGHT_TIME = night_time_onehot.values)
    
    ## Morning Rush 07:00-10:59
    morning_rush = (collision_time >= morning_rush_begin) & (collision_time < morning_rush_end)
    morning_rush_onehot = pd.get_dummies(morning_rush).loc[:, True].astype(int)
    collision = collision.assign(MORNING_RUSH = morning_rush_onehot.values)
    
    ## Night time 00:00-06:59 & 20:00-00:00
    day_time = (collision_time >= morning_rush_end) & (collision_time < evening_rush_begin)
    day_time_onehot = pd.get_dummies(day_time).loc[:, True].astype(int)
    collision = collision.assign(DAY_TIME = day_time_onehot.values)
    
    ## Evening Rush 16:00-19:59
    evening_rush = (collision_time >= evening_rush_begin) & (collision_time < evening_rush_end)
    evening_rush_onehot = pd.get_dummies(evening_rush).loc[:, True].astype(int)
    collision = collision.assign(EVENING_RUSH = evening_rush_onehot.values)
    
    return collision

def _create_date_features(collision):
    # Create one-hot weekday/weekend representation
    collision_day = collision['DATE'].dt.dayofweek
    
    ## Weekday 0, 1, 2, 3, 4
    ## Weekend 5, 6
    is_weekday = (collision_day <= 4)
    is_weekday_onehot = pd.get_dummies(is_weekday).astype(int)
    
    ## Weekday
    weekday_onehot = is_weekday_onehot.loc[:, True]
    collision = collision.assign(WEEKDAY = weekday_onehot.values)

    ## Weekend
    weekend_onehot = is_weekday_onehot.loc[:, False]
    collision = collision.assign(WEEKEND = weekend_onehot.values)
    
    return collision

We further proceeded to create a one-hot encoding of the vehicle types that were available in the dataset. NYPD has put almost half of the vehicles in a general group called 'PASSENGER VEHICLE', which seems to represent the common sedan type of vehicle. Other types, like SUVs, motorcycles, small and large commercial vehicles, have their own types.

We removed some collision types (like the collisions in which the vehicle was an 'AMBULANCE') with very few collisions. This further cut about 5% of the dataset.


In [4]:
def create_vehicle_features(collision):
    # Create one-hot vehicle type representation
    vehicle_types_onehot = pd.get_dummies(collision.loc[:, 'VEHICLE TYPE CODE 1']).astype(int)
    
    # Merge Motorcycle & Scooter columns
    motorcycle = vehicle_types_onehot.loc[:, 'MOTORCYCLE'] + vehicle_types_onehot.loc[:, 'SCOOTER']
    vehicle_types_onehot = vehicle_types_onehot.drop('MOTORCYCLE', axis=1)
    vehicle_types_onehot = vehicle_types_onehot.drop('SCOOTER', axis=1)
    vehicle_types_onehot = vehicle_types_onehot.assign(MOTORCYCLE = motorcycle.values)
    
    # Concatanate one-hot with collisions
    collision = pd.concat([collision, vehicle_types_onehot], axis=1)
    
    # Drop unneeded collisions
    vehicles_to_drop = ['OTHER', 'AMBULANCE', 'PEDICAB', 'FIRE TRUCK', 'LIVERY VEHICLE']
    collisions_to_drop = vehicle_types_onehot.loc[:, vehicles_to_drop[0]]
    for i in xrange(1, len(vehicles_to_drop)):  # Start from 1 since we already have OTHER
        collisions_to_drop += vehicle_types_onehot.loc[:, vehicles_to_drop[i]]
    collisions_to_keep = (collisions_to_drop == 0)
    collision = collision[collisions_to_keep]
    collision = collision.reset_index(drop=True)  # Reset index due to dropped rows
    
    # Drop unneeded vehicle columns
    for column in vehicles_to_drop:
        collision = collision.drop(column, axis=1)
    
    # Drop vehicle type column
    collision = collision.drop('VEHICLE TYPE CODE 1', axis=1)
        
    return collision

We proceeded by assigning a 'severity score' to each collision. Each collision, by default, has a severity score of 1. The score increases by 2 for each injury and by 10 for each death. These are arbitrary values we came up with and have no bearing on the cost of life in these collisions.


In [5]:
# Default score values
INJURY_SCORE = 2
DEATH_SCORE = 10

def create_severity_score_feature(collision):
    # Give all collisions a score of 1 by default
    collision['SEVERITY SCORE'] = 1
    
    # For each injury, add to the score of the collision
    collision['SEVERITY SCORE'] += (collision.loc[:, 'NUMBER OF PERSONS INJURED'] * INJURY_SCORE)
    
    # For each death, add to the score of the collision
    collision['SEVERITY SCORE'] += (collision.loc[:, 'NUMBER OF PERSONS KILLED'] * DEATH_SCORE)
    
    return collision

We then load the data:


In [6]:
# Load the collision data
collision = load_data(raw_data_file_name)
collision = create_temporal_features(collision)
collision = create_vehicle_features(collision)
collision = create_severity_score_feature(collision)

print collision.head()
print collision.dtypes
print len(collision)


     BOROUGH  ZIP CODE   LATITUDE  LONGITUDE     ON STREET NAME  \
0  MANHATTAN     10075  40.775640 -73.960511     EAST 79 STREET   
1     QUEENS     11420  40.681260 -73.818679         122 STREET   
2   BROOKLYN     11217  40.684052 -73.977458    ATLANTIC AVENUE   
3      BRONX     10472  40.832281 -73.871304       CROES AVENUE   
4     QUEENS     11357  40.786854 -73.822349  PARSONS BOULEVARD   

  CROSS STREET NAME  NUMBER OF PERSONS INJURED  NUMBER OF PERSONS KILLED  \
0       PARK AVENUE                          0                         0   
1        111 AVENUE                          0                         0   
2   FLATBUSH AVENUE                          0                         0   
3   EAST 172 STREET                          1                         0   
4         14 AVENUE                          2                         0   

   NUMBER OF PEDESTRIANS INJURED  NUMBER OF PEDESTRIANS KILLED  \
0                              0                             0   
1                              0                             0   
2                              0                             0   
3                              0                             0   
4                              0                             0   

        ...        BUS  LARGE COM VEH(6 OR MORE TIRES)  PASSENGER VEHICLE  \
0       ...          0                               0                  1   
1       ...          0                               0                  0   
2       ...          0                               0                  1   
3       ...          0                               0                  1   
4       ...          0                               0                  1   

   PICK-UP TRUCK  SMALL COM VEH(4 TIRES)  SPORT UTILITY / STATION WAGON  TAXI  \
0              0                       0                              0     0   
1              0                       0                              1     0   
2              0                       0                              0     0   
3              0                       0                              0     0   
4              0                       0                              0     0   

   VAN  MOTORCYCLE  SEVERITY SCORE  
0    0           0               1  
1    0           0               1  
2    0           0               1  
3    0           0               3  
4    0           0               5  

[5 rows x 32 columns]
BOROUGH                            object
ZIP CODE                            int64
LATITUDE                          float64
LONGITUDE                         float64
ON STREET NAME                     object
CROSS STREET NAME                  object
NUMBER OF PERSONS INJURED           int64
NUMBER OF PERSONS KILLED            int64
NUMBER OF PEDESTRIANS INJURED       int64
NUMBER OF PEDESTRIANS KILLED        int64
NUMBER OF CYCLIST INJURED           int64
NUMBER OF CYCLIST KILLED            int64
NUMBER OF MOTORIST INJURED          int64
NUMBER OF MOTORIST KILLED           int64
UNIQUE KEY                          int64
NIGHT_TIME                          int64
MORNING_RUSH                        int64
DAY_TIME                            int64
EVENING_RUSH                        int64
WEEKDAY                             int64
WEEKEND                             int64
BICYCLE                             int64
BUS                                 int64
LARGE COM VEH(6 OR MORE TIRES)      int64
PASSENGER VEHICLE                   int64
PICK-UP TRUCK                       int64
SMALL COM VEH(4 TIRES)              int64
SPORT UTILITY / STATION WAGON       int64
TAXI                                int64
VAN                                 int64
MOTORCYCLE                          int64
SEVERITY SCORE                      int64
dtype: object
577056

We proceed with parsing NYC's bike lanes list web page to add another feature to our collision data: whether the street the collision occurred in had a bicycle lane or not. We do that by manually parsing the bicycle lanes page and matching them to the street names of collisions.


In [7]:
def get_bicycle_lanes_from_page(page_html):
    """
    Parse the table contained in the bicycle lanes webpage.

    Args:
        page_html (string): String of HTML corresponding to the data source webpage.

    Returns:
        a dictionary that contains mappings from a category to the list containing the data.
        These categories are: street, begins, ends, and borough.
    """
    soup = BeautifulSoup(page_html, 'html.parser')
    bicycle_lanes_list = {
        'street': [],
        'begins': [],
        'ends': [],
        'borough': []
    }

    table = soup.findChildren('tbody')[0]
    rows = table.findChildren(['tr'])
    
    for row in rows:
        cells = row.findChildren('td')
        content = cells[0].text
        m = re.search(r'^([a-zA-Z\s0-9\-\.,\(\)]+) (from)*(between)* '
                      r'([a-zA-Z\s0-9\-\.,\(\)]+) (to)*(and)* '
                      r'([a-zA-Z\s0-9\-\.,\(\)]+)$', content)
        
        # Content that does not follow this syntax is discarded because
        # it refers to landscapes or parks.
        if m is not None:
            bicycle_lanes_list['street'].append(m.group(1).upper())
            bicycle_lanes_list['begins'].append(m.group(4).upper())
            bicycle_lanes_list['ends'].append(m.group(7).upper())
            bicycle_lanes_list['borough'].append(cells[2].text.upper())

    return bicycle_lanes_list
    
def extract_bicycle_lanes(url):
    """
    Retrieve all of the bicycle lane information for the city of New York.

    Parameters:
        url (string): page URL corresponding to the listing of bicycle lane information.

    Returns:
        bicycle_lanes (Pandas DataFrame): list of dictionaries containing extracted lane information.
    """
    bicycle_lanes_page = requests.get(url).text
    bicycle_lanes_list = get_bicycle_lanes_from_page(bicycle_lanes_page)
            
    bicycle_lanes = pd.DataFrame(bicycle_lanes_list)
    bicycle_lanes.loc[bicycle_lanes.borough == 'THE BRONX', 'borough'] = 'BRONX'
    columns_to_rename = {'borough': 'BOROUGH', 'begins': 'BEGINS', 'ends': 'ENDS', 'street': 'ON STREET NAME'}
    bicycle_lanes = bicycle_lanes.rename(columns=columns_to_rename)
    
    return bicycle_lanes

In [8]:
bicycle_lanes = extract_bicycle_lanes(bicycle_lanes_page_url)

print bicycle_lanes.head()
print len(bicycle_lanes)


               BEGINS BOROUGH              ENDS      ON STREET NAME
0        BOUCK AVENUE   BRONX  KINGSLAND AVENUE     ALLERTON AVENUE
1    BRONXWOOD AVENUE   BRONX     STEDMAN PLACE     ALLERTON AVENUE
2   BAYCHESTER AVENUE   BRONX     MALL ENTRANCE       BARTOW AVENUE
3       BRYANT AVENUE   BRONX    EDGEWATER ROAD  BRUCKNER BOULEVARD
4  BRUCKNER BOULEVARD   BRONX      WENNER PLACE        BRUSH AVENUE
493

The bicycle lanes are defined by four features:

  1. Which street they're on (ON STREET NAME)
  2. Which cross street they begin on (BEGINS)
  3. Which cross street they end on (ENDS)
  4. The borough they're on (BOROUGH)

By using these four values, we'll match them to the location of the collisions and add a 'HAS_BIKE_LANE' column to the collisions data frame, indicating whether the collision occurred on a street that has a bike lane.


In [9]:
def merge_bicycle_lanes(df1, df2, columns):
    # Populate collisions with bicycle lane data by matching on 'columns'
    result = pd.merge(df1, df2, how='left', on=columns)
    
    # Create a 'HAS_BIKE_LANE' column to indicate the presenc of a bike lane
    result = result.assign(HAS_BIKE_LANE = [0 if x is np.nan else 1 for x in result.loc[:, 'BEGINS']])
    
    # Drop unneeded columns
    result = result.drop('BEGINS', axis=1)
    result = result.drop('ENDS', axis=1)
    
    # Multiple bike lanes can be on the same street, disregard that fact
    result = result.drop_duplicates()
    
    # Reset index to account for duplicates
    result = result.reset_index(drop=True)
    
    return result

In [10]:
collision = merge_bicycle_lanes(collision, bicycle_lanes, ['ON STREET NAME', 'BOROUGH'])

print collision.head()
print collision.dtypes
print len(collision)


     BOROUGH  ZIP CODE   LATITUDE  LONGITUDE     ON STREET NAME  \
0  MANHATTAN     10075  40.775640 -73.960511     EAST 79 STREET   
1     QUEENS     11420  40.681260 -73.818679         122 STREET   
2   BROOKLYN     11217  40.684052 -73.977458    ATLANTIC AVENUE   
3      BRONX     10472  40.832281 -73.871304       CROES AVENUE   
4     QUEENS     11357  40.786854 -73.822349  PARSONS BOULEVARD   

  CROSS STREET NAME  NUMBER OF PERSONS INJURED  NUMBER OF PERSONS KILLED  \
0       PARK AVENUE                          0                         0   
1        111 AVENUE                          0                         0   
2   FLATBUSH AVENUE                          0                         0   
3   EAST 172 STREET                          1                         0   
4         14 AVENUE                          2                         0   

   NUMBER OF PEDESTRIANS INJURED  NUMBER OF PEDESTRIANS KILLED      ...        \
0                              0                             0      ...         
1                              0                             0      ...         
2                              0                             0      ...         
3                              0                             0      ...         
4                              0                             0      ...         

   LARGE COM VEH(6 OR MORE TIRES)  PASSENGER VEHICLE  PICK-UP TRUCK  \
0                               0                  1              0   
1                               0                  0              0   
2                               0                  1              0   
3                               0                  1              0   
4                               0                  1              0   

   SMALL COM VEH(4 TIRES)  SPORT UTILITY / STATION WAGON  TAXI  VAN  \
0                       0                              0     0    0   
1                       0                              1     0    0   
2                       0                              0     0    0   
3                       0                              0     0    0   
4                       0                              0     0    0   

   MOTORCYCLE  SEVERITY SCORE  HAS_BIKE_LANE  
0           0               1              0  
1           0               1              0  
2           0               1              1  
3           0               3              0  
4           0               5              1  

[5 rows x 33 columns]
BOROUGH                            object
ZIP CODE                            int64
LATITUDE                          float64
LONGITUDE                         float64
ON STREET NAME                     object
CROSS STREET NAME                  object
NUMBER OF PERSONS INJURED           int64
NUMBER OF PERSONS KILLED            int64
NUMBER OF PEDESTRIANS INJURED       int64
NUMBER OF PEDESTRIANS KILLED        int64
NUMBER OF CYCLIST INJURED           int64
NUMBER OF CYCLIST KILLED            int64
NUMBER OF MOTORIST INJURED          int64
NUMBER OF MOTORIST KILLED           int64
UNIQUE KEY                          int64
NIGHT_TIME                          int64
MORNING_RUSH                        int64
DAY_TIME                            int64
EVENING_RUSH                        int64
WEEKDAY                             int64
WEEKEND                             int64
BICYCLE                             int64
BUS                                 int64
LARGE COM VEH(6 OR MORE TIRES)      int64
PASSENGER VEHICLE                   int64
PICK-UP TRUCK                       int64
SMALL COM VEH(4 TIRES)              int64
SPORT UTILITY / STATION WAGON       int64
TAXI                                int64
VAN                                 int64
MOTORCYCLE                          int64
SEVERITY SCORE                      int64
HAS_BIKE_LANE                       int64
dtype: object
577056

Evaluation

We now need to find areas in which collisions occur frequently. This will let us identify collision-prone areas in NYC. All the collisions we have right now have their latitude & longitude information, which is too granular to identify areas. We also have the borough and Zip codes of the locations, but those are not granular enough. What we need to do is to define 'clusters' from the locations of collisions, which we can use to determine areas. We can use the k-means algorithm to achieve this.

Usually, we try to optimize the number of means to give us a good balance of error & generalizability. While we can do that here as well, we can manually choose the number of means to specify how granular we'd like to be with our zoning: selecting a single mean will, naturally, cover the entire NYC metropolitan area but won't provide us with any additional information. By visually observing the results of choosing from a number of means, we can find a good balance of the number of zones that translate logically to the layout of the NYC metropolitan area.


Let's start with some helper code that will find those centers:


In [11]:
def get_collision_centers(collision, num_centers):
    """
    Runs K-means, find the centers that correspond to collisions and returns them, along with
    each collision's label (which center it belongs to).
    """
    location_data = collision.loc[:, ['LATITUDE', 'LONGITUDE']]
    centroids, labels = kmeans2(location_data, num_centers)
    return centroids, labels

We will also use a function that takes collisions and plots them, along with some of its features (such as the centers, the membership of collisions, etc.). Effectively visualizing this kind of data is of paramount importance, since the evaluation depends on observation of collision hotspots.

The function takes a list of collisions and plots them on an interactive map of New York City. We provide the function with the collisions we would like to plot, whether we want to show the severity of the collisions, whether we want to demonstrate their cluster membership (which are obtained through kmeans), etc.


In [12]:
def plot_collisions(collision, filter_features=[], show_severity=False, centroids=None, labels_of_collisions=None,
                    color_collisions_based_on_labels=False, only_show_num_collisions=None, 
                    open_map_after_plot=False, output_file_name='map', lat=40.77, lon=-74.009725, zoom=12):
    # Create the output files
    create_map_html_file(output_file_name)
    create_map_script_file(output_file_name, lat, lon, zoom)
    
    collisions_to_plot = collision.copy().reset_index(drop=True)
    
    if labels_of_collisions is not None:
        # Append the labels
        collisions_to_plot = collisions_to_plot.assign(LABELS = labels_of_collisions)
    
    # Filter the collisions based on the list of features that were given
    for feature in filter_features:
        collisions_to_plot = collisions_to_plot.loc[collisions_to_plot[feature] == 1, :].reset_index(drop=True)

    # After filtering, throw away the unneeded features
    if labels_of_collisions is not None:
        collisions_to_plot = collisions_to_plot.loc[:, ['LATITUDE', 'LONGITUDE', 'SEVERITY SCORE', 'LABELS']]
    else:
        collisions_to_plot = collisions_to_plot.loc[:, ['LATITUDE', 'LONGITUDE', 'SEVERITY SCORE']]
        
    # Randomly pick some number of collisions if specified
    num_collisions_total = collisions_to_plot.shape[0]
    if only_show_num_collisions and only_show_num_collisions < num_collisions_total:
        collisions_shuffled = collisions_to_plot.loc[np.random.choice(num_collisions_total, 
                                                                      only_show_num_collisions, 
                                                                      replace=False), :]
        collisions_to_plot = collisions_shuffled[:only_show_num_collisions].reset_index(drop=True)
    
    if show_severity:  # Plot with heatmap
        # Get the color range
        unique_severity_scores = sorted(collisions_to_plot.loc[:, 'SEVERITY SCORE'].unique())
        num_unique_severity_scores = len(unique_severity_scores)
        colors = list(Color('Yellow').range_to('Red', num_unique_severity_scores))
        
        # Plot each severity level with its own color
        for index, score in enumerate(unique_severity_scores):
            collisions_with_score = collisions_to_plot.loc[collisions_to_plot['SEVERITY SCORE'] == score, :]
            append_points_to_script(output_file_name, collisions_with_score, colors[index], 50)
    elif labels_of_collisions is not None:  # Plot with color code
        # Plot each center's points
        color_to_use = Color('Blue')
        for center in xrange(len(centroids)):
            if color_collisions_based_on_labels and len(centroids) <= len(prime_colors):
                color_to_use = Color(prime_colors[center])
            collisions_with_label = collisions_to_plot.loc[collisions_to_plot.loc[:, 'LABELS'] == center, :]
            append_points_to_script(output_file_name, collisions_with_label, color_to_use, 50)
    else:  # Plot simple
        append_points_to_script(output_file_name, collisions_to_plot, Color('Blue'), 50)
            
    if centroids is not None:
        # Plot the centroids
        append_points_to_script(output_file_name, centroids, Color('Orange'), 150)
        
    # Generate map URL
    map_url = get_map_output_url(output_file_name)
    
    # Optionally open it in a browser window
    if open_map_after_plot:
        webbrowser.open(map_url)
    
    return map_url

Walkthrough

With this function, we started mixing features, looking for some visible result that showed some interesting facts. During this process, we decided to focus on collisions that occur on streets with bike lanes. We know that traffic changes in quantity between workdays and the weekend, so we decided to start with collisions involving Taxis in the Manhattan borough. Let's see if we find something interesting:


In [13]:
# Only consider the collisions that took place in Manhattan
collisions_in_manhattan = collision.loc[collision['BOROUGH'] == 'MANHATTAN', :]

# Define features to filter the collisions by
filter_features = ['NIGHT_TIME', 'TAXI', 'HAS_BIKE_LANE']

# Plot the specified collisions and print its URL
print plot_collisions(collisions_in_manhattan,                        # The list of collisions to plot
                      filter_features=filter_features + ['WEEKDAY'],  # Which features to filter the collisions by
                      show_severity=True,                             # Show the severity of collisions
                      output_file_name='TAXI_WEEKDAY_BIKE_NIGHT',     # Name of the output HTML & JS map files
                      open_map_after_plot=True)                       # Open the map in a browser window

# Plot the specified collisions and print its URL
print plot_collisions(collisions_in_manhattan,                        # The list of collisions to plot
                      filter_features=filter_features + ['WEEKEND'],  # Which features to filter the collisions by
                      show_severity=True,                             # Show the severity of collisions
                      output_file_name='TAXI_WEEKEND_BIKE_NIGHT',     # Name of the output HTML & JS map files
                      open_map_after_plot=True)                       # Open the map in a browser window
Taxis, Weekdays, at Night (with bike lanes)
Taxis, the Weekend, at Night (with bike lanes)

The results above show that the severity of collisions at night in places with bike lanes tends to be greater during the week than on the weekend. This could be merely due to the amount of car activity due to jobs in the area. One noticeable thing here is an apparent increase in the severity of collisions close to Time Warner Center located at the lower left corner of Central Park on the weekends.


Let's look at the same features, but at daytime rather than nighttime:


In [14]:
filter_features = ['DAY_TIME', 'TAXI', 'HAS_BIKE_LANE']

print plot_collisions(collisions_in_manhattan, 
                      filter_features=filter_features + ['WEEKDAY'],
                      show_severity=True,          
                      output_file_name='TAXI_WEEKDAY_BIKE_DAY',
                      open_map_after_plot=True)

print plot_collisions(collisions_in_manhattan, 
                      filter_features=filter_features + ['WEEKEND'],
                      show_severity=True,          
                      output_file_name='TAXI_WEEKEND_BIKE_DAY',
                      open_map_after_plot=True)
Taxis, Weekdays, Daytime (with bike lanes)
Taxis, Weekends, Daytime (with bike lanes)

Similar to the results above, the severity of the collisions are greater on the weekdays and more concentrated towards the area of South Manhattan. This appears to be the opposite in that part of the borough during the weekend. It appears that the South-West corner of Central Park remains a problem area.

The results shown here demonstrate that collisions are more likely to happen during the night and close to Central Park or the South. We are not able to determine only based on this data as to what causes these hotspots. It may be that they are favorite places to ride a bike (like Central Park) or that it is an important commercial or business center. However, it is clear that collisions in this parts of town are frequent during certain parts of the day or of the week.


Let's see if buses follow a similar pattern since they are in many places popular means of transportation.


In [15]:
filter_features = ['DAY_TIME', 'BUS', 'HAS_BIKE_LANE']

print plot_collisions(collisions_in_manhattan, 
                      filter_features=filter_features + ['WEEKDAY'],
                      show_severity=True,
                      output_file_name='BUS_WEEKDAY_BIKE_DAY',
                      open_map_after_plot=True)

print plot_collisions(collisions_in_manhattan, 
                      filter_features=filter_features + ['WEEKEND'],
                      show_severity=True,
                      output_file_name='BUS_WEEKEND_BIKE_DAY',
                      open_map_after_plot=True)
Bus Collisions on weekdays during day (with bike lanes)
Bus Collisions on weekends during day (with bike lanes)

From the maps above, we see the obvious difference in the number of collisions overall. We can also observe an increase in the severity of the collisions during the week, compared to the weekend.


As we have seen through this walkthrough, it is quite easy to get interesting information out of collision data, through the use of features of collisions. We have determined a correlation between areas of Manhattan, the time of day, the day of the week and the severity of collisions.


Identifying Collision Hotspots

Through the use of our collision data, we can try to infer the areas where collisions concentrate. This information can allow emergency response units to position their units according to historical collision data. This is useful for reasons including:

  1. They can put units in the centers of collision hotspots, allowing faster access to future collisions.
  2. They can have fewer units, but better placed, which could lower the costs for the local government while maintaining quick and efficient response time.

Let's start by identifying some collision hotspots in the Manhattan borough:


In [16]:
# Run kmeans with 15 centroids, find 15 'collision hotspots'
centroids, labels = get_collision_centers(collisions_in_manhattan, 15)

# Plot the collisions, along with the centroids
print plot_collisions(collisions_in_manhattan,        # The list of collisions to plot
                      centroids=centroids,            # The centroids of collisions, chosen through kmeans
                      labels_of_collisions=labels,    # The list of which collisions belong to which centroid
                      only_show_num_collisions=2000,  # Randomly pick only some collisions to show, for a cleaner map
                      open_map_after_plot=True)       # Open the generated map in a browser window
15 Collision Hotspots of Manhattan

You can observe that a small area between the 25th and 60th Streets, the Midtown Manhattan area, has as many collision hotspots as the northern part of Manhattan, including uptown Manhattan area, and twice as many collision hotspots as Lower Manhattan. This is an interesting result that suggests just how many more collisions happen Midtown compared to the rest of Manhattan.

We can focus on parts of New York City to get a more granular understanding of problematic areas. Let's take Midtown Manhattan, for example. We can run kmeans only on the collisions that occur in Midtown Manhattan and plot them. To (roughly) get the collisions in Midtown Manhattan, we can define and use a function to return the collisions that are in a circle:


In [17]:
def get_collisions_in_circle(collisions, latitude, longitude, radius_miles):
    location_data = collisions.loc[:, ['LATITUDE', 'LONGITUDE']]
    center = np.array([latitude, longitude])
    # Calculate distances of each collision to the center
    distances = location_data.apply(lambda x: distance((x['LATITUDE'], x['LONGITUDE']), center).miles, axis = 1)
    # Return the collisions inside the circle
    return collisions[distances <= radius].reset_index(drop=True)

We can now define a center and radius that (roughly) encircles Midtown Manhattan area, run kmeans on it to get the collision hotspots in that area...


In [18]:
# The circle that roughly corresponds to Midtown Manhattan
center_lat = 40.7564028  # 40°45'23.05"N
center_lon = -73.9844611  # 73°59'04.06"W
radius = 1.08  # miles

# Get the collisions in midtown
collisions_in_midtown_manhattan = get_collisions_in_circle(collisions_in_manhattan, center_lat, center_lon, radius)

...and plot all of them:


In [19]:
# Run kmeans with 10 centroids, find 10 'collision hotspots'
centroids, labels = get_collision_centers(collisions_in_midtown_manhattan, 15)

# Plot these collisions
print plot_collisions(collisions_in_midtown_manhattan,
                      centroids=centroids,
                      labels_of_collisions=labels,
                      only_show_num_collisions=500,
                      lat=center_lat, lon=center_lon,  # Center the map to this latitude & longitude
                      zoom=14,                         # Zoom in to the map straight away
                      open_map_after_plot=True)

Perhaps unsurprisingly, we can see that a collision hotspot appears dead centered on Times Square (the hotspot at the middle of the image):

15 Collision Hotspots of Midtown Manhattan

This seems to confirm something that most people who have seen Times Square and its traffic: it is a total mess that results in many collisions.

One can also observe a collision hotspot on the South-West corner of Central Park, which also came into question during the severity analysis above as an area that has higher-than-average severity collisions.


Through this kmeans analysis, we can get a good insight into the areas of New York City that result in a higher number of collisions than the rest.

Prediction

We have spent time and effort in extracting a bunch of features for all the collisions. The question remains, however: how good are all these features in describing a collision's severity? In other words, how well can we predict the severity of a collision through the use of these features?

Let's start by describing the baseline: Most collisions in the data set do not have any injuries or deaths. If we only return a severity of '1', we will correctly predict the severity of ~80% of collisions:


In [20]:
score = sum(collision['SEVERITY SCORE'] == 1) / float(len(collision))
accuracy = '{:.4f}%'.format(score * 100)
print 'Baseline:', accuracy


Baseline: 81.3030%

Let's try to see how well some individual features can predict the severity in order to compare it to the combined features' effectiveness. Let's try the time of day.

It might seem reasonable to expect the most severe collisions to happen during rush hours. Therefore, the time of day might also seem like a good predictor of the severity of the collision. We can test this hypothesis by randomly choosing a subset of our collisions, running logistic regression on it by using just the 'time of day' feature and testing it on a verification set, also randomly chosen.


In [21]:
def get_training_sets(collision):
    """Gets a list of collisions and returns the set of training features and 
    labels from them. The severity score of each collision are their respective 
    label.
    
    Parameters:
        X (Pandas DataFrame): a dataframe that contains collisions
        
    Returns:
        X, Y (Pandas DataFrames): tuple of DataFrames that have features and labels of 
                                  collisions, respectively
    """
    X = collision
    Y = collision['SEVERITY SCORE']
    
    # Convert BOROUGH to one-hot encoding
    if 'BOROUGH' in X.columns:
        borough_onehot = pd.get_dummies(X['BOROUGH']).astype(int)
        X = pd.concat([X, borough_onehot], axis=1)
    
    # Drop unneeded columns
    columns_to_drop = ['BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE', 
                       'ON STREET NAME', 'CROSS STREET NAME', 'SEVERITY SCORE', 
                       'UNIQUE KEY', 'NUMBER OF PERSONS INJURED', 
                       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED', 
                       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED', 
                       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED', 
                       'NUMBER OF MOTORIST KILLED']
    for column in columns_to_drop:
        if column in X.columns:
            X = X.drop(column, axis=1)
    
    return X, Y

# The number of collisions we have
num_collisions_total = collision.shape[0]
# Number of collisions we want in our training set
num_collisions_training = 40000
# Number of collisions we want in our test/validation set
num_collisions_test = 4000

# Get a random choice of collisions
collisions_shuffled = collision.loc[np.random.choice(num_collisions_total, 
                                                     (num_collisions_training + num_collisions_test), 
                                                     replace=False), :]
training_collisions = collisions_shuffled[:num_collisions_training]
test_collisions = collisions_shuffled[num_collisions_training:]
X_tr, Y_tr = get_training_sets(training_collisions)
X_te, Y_te = get_training_sets(test_collisions)

In [22]:
# Throw out all the features but the time of day
time_of_day_features = ['NIGHT_TIME', 'MORNING_RUSH', 'DAY_TIME', 'EVENING_RUSH']
X_tr_time = X_tr.loc[:, time_of_day_features]
X_te_time = X_te.loc[:, time_of_day_features]

# Train the classifier on the 'time of day' features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr_time, Y_tr)
score = classifier.score(X_te_time, Y_te)

# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'Just time of day:', accuracy,


Just time of day: 81.4750%

As you can observe, the time of day wasn't much better at predicting the severity of collisions than the baseline.

Let's try another feature: the borough. We will see if the general area is a good indicator of the severity of collisions:


In [23]:
# Throw out all the features but the borough
borough_features = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND']
X_tr_borough = X_tr.loc[:, borough_features]
X_te_borough = X_te.loc[:, borough_features]

# Train the classifier on the 'borough' features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr_borough, Y_tr)
score = classifier.score(X_te_borough, Y_te)

# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'Just borough:', accuracy


Just borough: 81.4750%

As you can observe, the borough wasn't much better at predicting the severity of collisions than the baseline either. It's as bad as just using the time of day.

Now, let's use all of our features and see how good they are at predicting the severity of collisions:


In [24]:
# Train the classifier on all the features
classifier = linear_model.LogisticRegressionCV()
classifier.fit_transform(X_tr, Y_tr)
score = classifier.score(X_te, Y_te)

# Assess the score of the classifier through the validation set
accuracy = '{:.4f}%'.format(score * 100)
print 'All the features:', accuracy


All the features: 81.7000%

It seems like these features weren't enough to predict the severity of collisions. We can't achieve better-than-baseline by using just these features. There are two possible conclusions:

  1. Collision severity is completely random and follows no particular pattern.
  2. The features we have aren't enough to reliably predict the severity of collisions.

To understand which is the case, we could extract more features from this data by, for example, using Zip Codes instead of boroughs or combining it with other datasets from other sources. Ruling out the second case might not be as easy as it seems.

Accurate severity prediction could give first-response units proactive information on the severity of the collisions and allow them to respond with the appropriate amount of resources. For example, when the emergency services get notified of a collision, they can use the location & time information and the types of vehicles involved in the collison to predict the severity before they even arrive to the scene. They can then route 2 ambulances (instead of just 1) to the area before they even see the severity themselves, potentially saving lives.